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Abstract— Given a partially observable Markov decision 
process (POMDP) with finite state, input and measurement 
spaces and costly measurements and control, we consider the 
problem of when to sample and actuate. Both sampling and 
actuation are modeled as control actions in a framework that 
encompasses both estimation and intervention problems. The 
process evolves freely, only driven by disturbances, in between 
two consecutive control actions. Control actions are assumed 
to reset the conditional distribution of the state given the 
measurements to one of a finite number of distributions. We 
tackle the problem of finding the times at which these control 
actions should take place in order to minimize an average cost 
that penalizes states and the number of control actions. The 
problem is first shown to boil down to a stopping time problem. 
While the latter can be solved optimally, the complexity of the 
optimal policy is intractable. Thus, we propose two approximate 
methods. The first is inspired by relaxed dynamic programming 
and it is within an additive cost factor of the optimal policy. 
The second is inspired by consistent event-triggered control and 
ensures that the cost is smaller than that of periodic control for 
the same control rate. We conclude that the latter policy can 
deal with large dimension problems, as demonstrated in the 
context of precision farming. In such a context, the proposed 
policies rely on a limited number of remote sensors indicating 
the presence of weed to decide the timings of weed removal. 


I. INTRODUCTION 


There are many control and estimation applications where 
taking actions or intervening in an otherwise freely evolving 
process is costly but necessary. Thus, the times of these 
interventions must be carefully selected, possibly based 
on available process information. For instance, in remote 
monitoring limited sensors can work as a proxy to detect 
events that need to be confirmed by closer inspection and 
eventually handled. This is the case in precision farming, 
where diseases, water stress, nitrogen levels, or weed can be 
inferred, to some extent, from a few sensors placed in the 
field. However, costly farmer or (aerial) robot inspections are 
required for full situational awareness and handling. See [1], 
[2] where when to irrigate depends on soil moisture and 
temperature sensors, and [3] where when to apply nitrogen 
fertilizers depends on in-situ active-light reflectance mea- 
surements. Similar challenges arise in queuing control [4], 
[5], predictive maintenance [6], [7], stock trading [8], home 
surveillance [9], and disease outbreaks [10]. 

These and related problems have been studied in several 
fields mostly considering finite (or countable) state, input and 
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measurement spaces, leading to partially observable Markov 
decision processes (POMPDs) [4]-[8]. It is well-known that 
these problems are intractable and thus finding appropriate 
(close to optimal) strategies remains an important challenge. 
In turn, over the past two decades, much research has 
been carried out in the area of event-triggered control [11] 
that tackles the related problem of choosing the times to 
sample or actuate in a control loop based on states or 
events rather than the usually considered periodic time- 
triggered sampling and control strategies. In event-triggered 
control, state, input and measurement spaces are typically 
continuous, and therefore the literature that does consider 
finite spaces is scarce. A line of related research proposed 
n [12], [13] and [14], aims at finding control policies for 
POMDPs considering finite spaces that depend on events, 
defined in terms of given state transitions; events are pre- 
defined, whereas in many event-triggered control papers, as 
in the present paper, events are also to be scheduled as a 
function of the process information based on an optimal 
criterion [15]-[18]. This latter approach is followed in a 
different research line proposed in [19], [20] and [21], also 
considering POMDPs with finite spaces. However, [19], [20] 
proposes to decide the next sampling or control time based 
on the information up to the current sampling or control 
time and not in between the two. This parallels self-triggered 
control [11], which differs from event-triggered control. In 
fact, event-triggered control continuously monitors the state 
or output of the process to decide the next sampling or 
control time [11], which is the approach taken here. 

To be more precise, the present paper considers POMDPs 
with finite state, input, and measurement spaces. Control 
actions are used to model both costly sampling through 
information gathering and costly actuation through process 
intervention. Since these are costly, control actions only take 
place at some discrete times and the process evolves freely 
in between two consecutive control actions, possibly with 
inputs computed at control action times. Control actions are 
assumed to reset the conditional distribution of the state given 
the measurements to one of a finite number of distributions. 
In this sense, the effect of control actions is known and only 
when control actions should be enforced is to be determined. 
In fact, we tackle the problem of finding the times at which 
these control actions should take place in order to minimize 
an average cost that additively penalizes state configurations 
and the number of control actions. We show that the average 
cost problem can be tackled as a stopping time problem. 
While for this latter problem, an optimal policy can be 
obtained, its complexity is intractable. This motivates us to 


propose the two classes of approximate policies. 

First, we propose a class of policies inspired by relaxed 
dynamic programming, see [22]. These policies guarantee 
a cost within an additive constant factor of the cost of 
the optimal policy, rather than a multiplicative factor as in 
original relaxed dynamic programming [22]. As explained in 
the sequel, this is needed for the stopping time problem at 
hand since the cost can be negative. While the complexity 
of the approximate policy is by far smaller than that of the 
optimal policy, and it provides nearly optimal results when 
the state dimension is small, it becomes impractical when 
the state dimension is large (see example in Section VI). 

Second, inspired by [17], [18], we propose a class of 
so-called consistent policies that lead to a strictly smaller 
cost than that of periodic inspection for the same average 
inspection rate. The policies proposed here are different from 
the ones in [17], [18], both in form and in terms of their 
derivation, to account for the case that the state probability 
distribution resets to one of a set of possible distributions 
rather than a single one, which is a crucial assumption to 
capture important applications such as remote estimation. 

The applicability of the results is highlighted by a nu- 
merical case study in the context of precision farming. The 
proposed policies rely on limited remote sensors indicating 
the presence of weed to decide the timings of weed removal. 
Due to space discretization the state dimension is rather 
large. Differently from relaxed dynamic programming, the 
policy inspired by consistent event-triggered control can 
handle problems with a large state dimension. This shows 
that ideas inspired by the event-triggered control literature 
can be useful to determine the times to sample and control 
a POMDP. 

The remainder of the paper is organized as follows. 
Section II provides the problem formulation and some appli- 
cations that can be tackled within this framework. Section III 
provides the optimal policy and explains the intractability 
issue. Sections IV and V provide the approximate policies 
and main results, based on relaxed dynamic programming 
and the consistent policies, respectively. Simulation results 
are discussed in Section VI and concluding remarks in 
Section VII. Due to space limitations the proofs of the results 
are omitted but can be found in the appendix. 


II. PROBLEM FORMULATION 


Consider a dynamical system with discrete state, input, 
and measurement spaces 


(1) 


Tt+1 = f(T, Ut, We) 
Ut = h( xz, Ut, ve) 


where x, E {1,2,...,n}, uz E {1,2,..., Nnu}, y € 
{1,2,..., ny}, we E {1,2,..., mw}, ve E {1,2,...,n.} are 
the state, control input, measurement, process disturbance 
input, and measurement noise input at time t € No := 
N U {0}, respectively. The disturbance sequences {w;|t € 
No} and {v,|t € No} are assumed to be independent and 
identically distributed disturbance sequences (i.i.d.), which 
are also mutually independent. The fact that the output 


equation in (1) also depends on the control input u, allows 
to tackle sensor management problems [23]. Consider also 
an average cost given by 














T-1 
des a > slg (2, Ue) (2) 


Proper (ergodicity) assumptions will be given in the sequel 
for this limit to exist (see Assumption 2). If us has no 
restrictions then this is a standard infinite-horizon POMDP. 
However, in our formulation selecting or modifying us is 
costly and we impose appropriate restrictions. 

Section II-A provides such a formulation and states the 
problem. Section II-B provides some examples of applica- 
tions that are encompassed by the proposed framework. 


A. Formulation with costly control rate and problem state- 
ment 


The control rate is assumed to be costly and thus u, can 
only be decided upon at so-called control times sẹ € No, 


Se41 = SetTe, (3) 


with 7e € N. For convenience, the intervals between control 
times are assumed to be bounded 


te <h, VEEN. (4) 


Since h € N can be arbitrarily large, this assumption is not 
very restrictive in practice. We introduce a binary control 
variable to indicate the time decisions of control actions 


o = 1, if sg = t for some £ 


and o = O otherwise. We assume that so = O and, thus, 
oo = 1. Note that {t € Nolo, = 1} = {sc|é € No}. At 
times t = s¢, a sequence of current and future inputs Uo), 
Uilt» ---, Uj_i}z 18 computed to be applied to the system in 
between control times 


Urtj = Uj), for j € {0,...,7¢—-1} when t = s for some £. 


Often there is one decision in the set uw, = u € {1,...,nu} 
that corresponds to a free (not controlled) mode of the 
system and Ujjs, = u for j € {1,...,7¢ — 1}. However, 
more general cases can be considered provided that the U;)., 
satisfy Assumption 3 below. 
Let po denote the initial state probability distribution 
Po = [Po Bo,2 Pon] 

with fo; = Problzo = iļ and po € Ph := {p = 
[pı ---Pn]T|l]Ip = 1,pi > 0,Vi}, where 1, denotes a 
column vector with n entries equal to one. Let also Z; = 
Ti U {Yt, Ct—1, Ut_-1} for t € N with Zp = {po} U {yo} 
denote the information available for decisions up to time t 
and 


Ptit = [Peje,1 Pt|t,2 Prien] 


denote the probability distribution of the state x, given the 
information set Z;, i.e., 


Piti = Probla, = |Z]. 


with pae E€ Pn. Let py41); be defined similarly but with 
Petite = Problxz41 = i|]. 


A crucial assumption is that either pj); or py+1), belongs to a 
known set of b possible distributions, denoted by p1,..., Pb» 
when actuation is computed (at control times). Formally: 

Assumption I; One of the following conditions holds, for 
every £ € No, 


(Sa) 
(Sb) 


(i) Psy\s¢ E {p1, erws po}, 
(ii) Pse+1|se E {p1, sss , Po}. 














Assumption 1(i) captures applications where the state be- 
comes either known or has a known probability distribution 
at time ¢ (through map f) when a control intervention on 
the process is carried out at time t, while Assumption 1 (ii) 
captures applications where the control actions at time t 
influence measurements at time ¢ through map h. Two 
examples are given in Section II-B. 

Let de € {1,...,b} be such that Ps,jse = pg, OF 
Psp+1|se = Pde, When Assumptions 1(i), 1(ii) hold respec- 
tively (if both hold either choice for ¢g can be picked). 
We can define a Markov chain with b states and transition 
probability matrix R with entries (i) Ri; = Prob[¢i41 = 
il+ = jl]. This Markov chain is assumed to be ergodic, 
meaning that it is aperiodic and irreducible. Thus, it has a 
stationary probability distribution. 

Assumption 2: There exists a unique a € P, such that 
a= Ra. 














Due to this assumption and since we are interested in an 
average cost (2) we can without loss of generality assume 
that the initial distribution of ¢o is a, that is Prob[¢p = i] = 
Mi t E Tysse oye 

A third assumption imposes that the control sequence 
Uj\s, only depends on @y. 

Assumption 3: Uj)s, = 9(j, pe) for every j € {0,...,h— 
1}, every £ € No, and for some function 0. 














These assumptions are motivated by and met in the applica- 
tions discussed in the sequel. 
Note that due to these assumptions in between control 
times the system evolves freely according to 
Tt41 = f(x, Ct, Ge, we), Se < t < S41 — L 
where 
GC = t = sq, L(t) = max{l|se < t}, 


and f (Xt, Ct, Ge, Wt) zA f (xt, (Ce, be), we 


average cost can be written as 
L(T)—18e41—1 


T—1 
Js= lim ZE > 5g (zr, e 9) +Y g(Tr, Gt, PLT) )I 


t=se t=sL(T) 
(6) 


where L(T) = max{é|se < T = 1}, g(t G ge) = 
g(x, (Cr, be)), and the output as 


Yt = h(x, Cts Qe, Ut), 


). Likewise, the 


Se <t< se41— l, 


Process Remote estimator 


t= Pail 
= {mlo = 1} 


Er+1 = afér, wt) 





Fig. 1: Remote estimation: The quantized state of a process is 
sent through a costly link to a remote estimator that computes 
the quantized state distribution given the previous received 
states, denote by Pz] z, and computes an estimate Ê of &; 
according to (10). 


where h(x, Ce, de, Ue) = R(xe, O(Ce, Ge), ve). 

The fact that the actuation is costly is handled by defining 
a cost that penalizes the average rate of control actions Je := 
liMmT o + X :-o Eloi]. Then, one can see the minimization 
of 














Jav = Js + 8Je (7) 


over different values of ô as finding Pareto optimal policies. 
Another interpretation for the cost Ja, is that there is an 
actual actuation cost and noticing that the running cost for 
Jav is g(£1, Uz) + d0;. The goal is to find a policy 


Ot = Lt (T) (8) 
for t € No, that minimizes Jav- 


B. Examples of applications 


1) Remote estimation with costly transmission: Consider 
a process described by a special case of (1) that does not 
depend on a control input and for which the full state is 
available 


Yt = Tr. (9) 


The dynamical model can equivalently be described by a 
transition matrix P with entries P;; = Prob[a;41 = il, = 
j], assumed to be ergodic. Model (9) results from quantizing 
a process 4, = alr wi) with & € R= and w iid. 
disturbances, and state x; labels one of n representative 
values £ of the quantized state variable ¢,. This is illustrated 
in Figure 1. Thus, there exists a labelling map 7 such that 
¿t = (i). The state is known to an agent who wishes to 
send it to a remote estimator. Control times s¢ are understood 
here in a broad sense as the times at which information is 
sent to the remote estimator, and, as before, are indicated by 
oz € {0,1} determining when information is sent (o = 1) 
or not (o; = 0). Transmissions are assumed to be expensive, 
e.g., due to battery limitations on the sensor side. On the 
remote side, an estimator obtains q = [dt din] 
and qii = Prob[a, = i|Z,] with Z; := {ye]0 < £ < t, og = 
1}, and gq € Pn and computes 


Tt+1 = (£i, we), 














é = T(E |Z] = X r(i) 1) dt, i (10) 
i=1 
Note that, assuming oo = 1, 
Otis if Ot = 1, (11) 
WS | PM See» if o= 0, 


where ô; is the column vector of zeros except at position 
i where it equals 1. The running cost of an average cost 
penalizes through the euclidean norm the difference between 
the state and the estimated state 


lz(z:) — Ål’. 


While (9) does not depend on the control input u, we can 
use the control input as a modeling variable to ensure we 
can write (2) with running cost (12). In fact, we can define 


(12) 


Ut =UEls, = Ts,» U24=G, for se <t<seyi—-1, 


and u € {1,..., nu}, with nu = nh to assign a unique label 
to the pair (u14,u24) € {1,...,n} x {1,...,h} Then (12) 
can be written as g(x, uz) and (10), (11) are also functions 
of the state and control input since, when o} = 0, qe = 
Pd, , and of = 1, qi = bx, Moreover, s,s, = ôr., € 
{ô1,.-., n} belongs to a finite set, U;),, are functions of 
oe = £s, and j and ergodicity of R follows from ergodicity 
of P. The goal is to find a policy (8) for when to apply 
control actions (send remote data) in order to minimize (7) 
where the running cost in Js is given by (12). 

2) Costly interventions based on limited data: Consider 
a set of discrete N interdependent states x; € {1,...,n;} 
evolving in a free, or non-controlled, fashion according to 


Litt = fil®is,..-,ent,wit), @€ {1,...,N}, (13) 


when there are no control interventions (o4 = 0), where 
the disturbance inputs w; live in a finite set. At control or 
intervention times (a; = 1) the state is reset to 


T1,t+1 = Q1, T2441 = Q2, ... CN t41 = AN (14) 


and a fixed cost 6 is paid for each intervention. Only a 
subset of states is measured by M sensors. Each sensor 
j € {1,...,M}, depends on a subset of n; states L; = 
{li --- la, } with l; € {1,..., N}, 


Yj = hj(@1,,--+, Tln V5) (15) 
where j € {1,..., M}, and where the measurements y; € 
{1,..., ny j} might be corrupted by noise v;, also living in 


a finite set. The running cost of an average cost is given by 


N 
GA(@1,t,---,0nt) = NO gilaan) (16) 
i=1 
In the context of precision agriculture (see Section VI 
for more details), each x;, might be a binary variable 
representing if there is weed (x; 4 = 2) in a given subarea of a 
field at time t or not (x; = 1); xi depends on neighboring 
states and only a subset of subareas can be measured. At 
intervention times sz the weed is completely removed, setting 
all the states to xj,5,41 = 1, i € {1,...,N}. For some 
constant d, representing the cost of having weed between t 


and t+ 1, 
gi(2) =d and g;(1)=0 Wie {1,...,N}. (17) 


We can find single variables x, E€ {1,...,n}, yw € 
{1,... ny}, n = n1 X+ X NN, Ny = Ny X X Ny M 


to label all possible state and output combinations and write 
the problem in the canonical formula described above, see 
Section VI. Here Pipit = da, corresponding to 2441 = Q, 
where a € {1,...,n} is the label corresponding to (14), 
and Assumption 1 is trivially met since R corresponds to 
a Markov chain with just one state. The control input uz 
can be used to model the process evolution and in particular 
the state reset (14), but it can also be omitted. The goal 
is to find a policy for o, as a function of the information 
set Z, = T1 U {yt, C11} for t € N, with Zp = {yo} 
to minimize the average cost (7) with the proposed running 
cost. 


III. OPTIMAL POLICY 


We start by providing a result that allows for simplifying 
the problem from an average cost problem to a stopping 
time problem. The stopping time problem is the following: 
consider system (1), which now needs to be considered only 
in the interval t € {0,1,...,h}. The information available 
to make a decision at time t € {1,...,h} on either sı = 
To € {1,...,h} is equal to t or larger, is summarized in 
Pojo = Poo, if (Sa) holds, or in Pijo = Pg, if (Sb) holds, 
and in the measurements y1, yo, ..., Y+- Thus, we define the 
information set 


H? = {ho} U {yell € {1,..., th}. 


Consider now the following optimal stopping time problem 
first for = 0 and sọ = 0: 


Te-1 
= ( [So J(Tse+t, Cse+t Pse )] +a) (18) 


efre] t=0 














Jstop = min 
Te 











where To is a stopping time with respect to the filtration 
corresponding to the information set #2. In other words, the 
event [ro = m] is a function of H?,. Having defined Tọ we 
can similarly define stopping times Tg with respect to the 
filtration corresponding to the information set Hf = {ġe} U 
{Ysp4+1) Ysp+2>-++>Yse+t} and consider an identical stopping 
time problem to (18) for a general £. These problems are 
identical due to Assumption 1, 2, 3 as stated next. These 
stopping times 7e define the control times according to (3). 

Lemma l: Suppose that Assumptions 1, 2, 3 hold. Then 
the optimal stopping time policies for problems (18) are 
identical in the sense that they take the form 


Te = min{r > llos,4- = 1} 


or = &¢,(HE) £ENo (19) 


for the same functions €;, i € {0,1,...,h — 1}. More- 
over, (19) is also an optimal policy for the average cost 
problem of minimizing (7) and Jay = Jstop. Furthermore, the 
optimal policy for problem (18) can be obtained by solving 
the following stopping time problem 


To—1 


min > (9(xt, Cr, 0) — B)] +6 


TO 
t=0 











(20) 





where 8 € R>o is the largest value for which the optimal 
solution to this new stopping problem (20) results in a zero 
cost and is given by 8 = Jw. 














In the proof of Lemma 1 we show that (20) is a non- 
increasing continuous function of 8 and thus has a unique 
zero (since for 6 = 0 it is positive and for very large it is 
negative). 

The optimal policy and cost for problem (20) for a given 
p can be obtained by the dynamic programming algorithm as 
shown next. The search for 8 is discussed at the end of the 
section. Let q = [qt dinl with qii = Prob[x, = 
i[H?] and 


n= [9(1, t, do) 


where the dependence of g, on the given ġo is omitted, as 
well as in other functions in the sequel, both for notation 
convenience and since often g does not depend on ġo. Then, 
one should iterate for t € {h —1,...,0} 


In (an) = ô, 
Jila) = min{6, gf 9 — B+ EJ (q1 )|H?]} 


and the optimal policy is 


g(n,t, @o)|" ,¢ € {0,1,...,2- 1}, 














Te = min{r € {1,...,h}os,47 = 1} 


_ flit Jelsa) =F orif & =h 
t ) 0 otherwise. 


Note that q can be iterated with the Bayes’ filter 





Gt+1 = D(yt41)PU 


a(Ye+1) 
with P;; = Prob[z:+1 = il, = j| and 
D(yt+1) E diag([Ry,.1,1 sih Rysyamn)) 
with Re; = Probly, = Lx; = j], for £ € {1,...,m}, j € 
{1,...,n} and 
a(t) = Prob[y = 4L] = 5 Re,jft, j. 
j=l 


with 7 = [71 --- Ten|’, re = Pa. The filter is initialized 
with qo = Pg, 1f (Sa) holds and ro = pg, if (Sb) holds. 
It is well known that (see, e.g., [22]) 


J = min c! 

e(ge) = min c'qe 

However, the size of set 7, denoted by |.%;|, grows as 
(Al = 1+ | Feral (21) 


leading to a computationally intractable method to find the 
optimal policy. We will briefly show this as the derivation 
is important in the sequel. Assume that Jiii(qi41) = 
mince z,,, C'G@+1, Which is true for t = h — 1 as we can 


write J (qr) = 8 = cT qr with c = 61, since 11. p;, = 1. To 
obtain an expression for J;(q:) we need to compute 
Ny 


[Je (a HP] = XO Elea (aera) yess = 6, Talh) 
l=1 


























eR 
=) Ilg Pao 
f=1 
= i T(D(£)P 
son eG) 
f=1 
Note that 6 = d3q, af Gt — p= d] q+ for d2 = 51 and 
diz = Jt — E1 so that 


ny 
JIi(Qe) = min(d5 qt, dI 44 + J min c!(D(¢)Pq)) 
A cETJt+1 
pana i T 


where J; = J; U {d2} and 


T >. {dit + S| PID cle, E Aq, 
@=1 
US bes Iny G {1, ely (Felt 
Note that indeed (21) holds. 

When g; depends on ġọ = 7 so will the cost to go and 
now we stress this by denoting the cost to go at time t = 0 
by JEP (q+) where the dependence on 8 is also added. The 
cost (20) is then equal to ya aJe? (pi). Thus, one needs 
to find 6 for which this cost is zero. To this end we can 
simply run a bisection algorithm as this is a monotone and 
continuous function of 8. 


IV. RELAXED DYNAMIC PROGRAMMING POLICY 


The idea of relaxed dynamic programming [22] is to find 
simpler functions to approximate J;(q,). Here we consider 
the following functions V; (q;,) = 6 and 


Vila) = min cT, t€{0,1,...,h-1} (22) 
cEVt 

where V; C Jı can be seen as a pruned version of 7. We 

will provide a procedure to obtain this pruned set in such a 


way that 


Ji (ad) K V; (qe) < Ji (ad) + e(h = t); tE {0, Tiei h— 1} 

(23) 
so that V;(q+) is always within an additive factor e(h — t) 
of the optimal policy and the resulting policy as well. 
Although the original idea of relaxed dynamic programming 
considered a multiplicative factor for the guarantees, i.e., 
Jela) < Vil) < (1 + €)Jelqt), here an additive factor 
is chosen for two reasons. First, in the present application 
Jı and V; take in general negative values (due to subtracting 
6 from the running cost (20)), thus the multiplicative bound 
makes no sense. Second in the present application J:(q:) < 6 
for every t and q so that it is easy to avoid scaling issues. 

Towards this, let us define 














Ji (qe) = min{ð + €, (q1 g — B + €) + E[Vi41 (qe41) [He] } 


Using similar steps to the ones used in the previous section 
we can conclude that 


Jp = min c! 
t (Gt) fete at 
where 


Jf = {d;i tel + SP PO Gaile E Vi+1, 
l=1 
Ji, Keats Jp € {1, sig Vilh} U {d2 +el} 


Function Jf coincides with J; when e = 0. However, this 
function is defined with an extra cost term for the running 
cost when e > 0. Given a č € JÉ, consider a corresponding 
é@ from the set Jf|e=o defined as above. 

At each time t, the set V; is a pruned version of the set 
Jı. To facilitate the pruning operation, which is carried out 
iteratively, it is wise to define a heuristic function H(c) which 
assigns a score to the elements c of a given set (say Jf). The 
higher the value of H(c) the larger the belief that c can be 
pruned. 

Relaxed Dynamic Programming procedure: 


1) Initialize V; as empty or with some members c for 
which cT€; is minimal for some &;. 

2) Take the element č in J£ \ V: with the smallest H and 
check if it satisfies 


minc'g < Tq Yq E Pn. 
cEV: 


(24) 


3) If (24) is not satisfied, then add ¿° to V;. If there are no 
more elements in J$, then stop, otherwise go to step 2. 


The heuristic used evaluates cTE; for each of ne members 
c, and for s fixed values €;, i € {1,...,s} and assigns a 
reward r; € {1,..., ne} to each member corresponding to 
the ranking in the ith ordered set according to c™&;. If the 
ranking of a given c is one for some i, it is immediately 
added according to step 1). Otherwise H(c) = Xj] ri. 

The next result states that indeed (23) is met with this 
procedure. 

Lemma 2: Let V; be defined by (22) with the set V; 
obtained from the relaxed dynamic programming procedure. 
Then (23) holds. 














The following test provides a sufficient condition to test 
if (24) holds and if (24) is replaced by this test the same 
guarantees can also be given: if there exists a; > O with 
pele a; = 1 such that, with ci € Vy, DA aici L Č 
then (24) holds. To test this latter condition we can simply 
test the feasibility of the simplex 


Aa <b (25) 
with 
C Č 
—I 0 
A= 1T ; b= 1 3 C= [c cvl- 
=T = 


V. CONSISTENT POLICY 


We start by providing a result stating the performance of 
periodic control. 

Lemma 3: Suppose that Assumptions 1, 2, 3 hold and that 
Te = h, VL € No, are constant, corresponding to periodic 
control. Then Ja := Js + Jp with 


b 
1 
Js =, = > Vh ii Ip = z (26) 
where 
{to 
OD gI P*)pi, if (5a) holds 
£=0 
Vh i = i pas 
zl Fa P +g Ppi, if (5b) holds 
£=0 














The proposed policy yields a better trade-off between 
average number of actions and average cost in the sense that 
if we define the curve 


(8, Jper(s)) 





nr)(s—r) ifs €[r,r+1) (27) 


(Rav,per ’ Jay ) 


with hayper = E[to] = 1/Je the average inter-control time 
and Jay the cost of that policy, is below this curve. We call 
this a consistent policy. We propose such a consistent policy 
inspired by [17], [18] but different both in terms of form and 
in terms of derivation. 

The proposed policy is defined as follows: 














sitr 


Ti = min{r € {1,...,h}| XO gf ye > —6 + weiger} (28) 


t=se 


for We; < Wm,i Where 
1 5 
Wm = min{y,; + b-|r E {1,...,h}}. 


Intuitively, assuming for simplicity b = 1 and £ = 0, if we 
knew the state and if we could make sure that 


To—1 


5 glz, t, 1) +6 —wWeito < 0 
t=0 
we would obtain also that 


To—1 


DD g(x, t, 1) +ô — we,170] < 0, 


t=0 














which would imply that 


To—1 


1 
JDD g(z,t, 1)] + S] < We,1 < Wm,1 
0 
t=0 


1 


[70] 
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Fig. 2: Precision agriculture setting. A strip of potato crops 
is divided into N subareas. Weed appears and spreads along 
this strip. A set of M sensors measure with some fault 
probability whether there is weed or not in a subset of 
subareas. The proposed policies allow to determine when 
a weed removal intervention is needed. 


meaning that we could ensure that such a policy would yield 
a better trade-off than any periodic policy. Although the state 
is not available, this policy still ensures this property by 
replacing the term g(x, t, do) by its expected value given the 
information up to time t, g] q+, and by taking into account 
the initial condition. 

A limitation is that if 6 is large and wm; is small the policy 
might lead to many control actions. Since ô can be seen as a 
parameter that sets the desired trade-off in the Pareto optimal 
curve it can be set to zero. The only restriction then comes 
from wm,i. The consistency property holds for any choice of 
ô and wy 3. 

Theorem 1: Suppose that Assumptions 1, 2, 3 hold. Let 
Jr be the cost Jay of the proposed policy (33) and let 7, = 
[Toa] = 1/Je be the average inter-control time. Then, 














Ir Toa as (29) 


That is, the policy is consistent. 














VI. APPLICATION IN PRECISION AGRICULTURE 


Consider the precision farming setting of Section II-B.2 
and assume that the field is a strip of potato crops divided 
in N subareas as depicted in Figure 2. The state x, € 
{1,...,2%} labels all possible states of the N weed indicator 
variables x+; € {1,2}. At initialization and directly after the 
interventions at times sz, the whole field has no weed. Weed 
can simply appear in a given subarea or spread from one of 
the neighboring subareas, which is summarized by 

Prob[x141,1 = 2|£t 1—1 = 4,241 = T Tt l+1 = b] 


ru, fa=1,b=1 
r2, fa=2,b=1 
ry2, fa=1,b=2 
r22, if a=2,b=2 


when | ¢ {1, N}, with 0 < ri < r21 = M12 < r22 < 1 
and Prob[z;41,1 = 2|£i2 = i, £41 = 1] = Prob[ai4in = 
2r N-11 = iren = ll = r, i {1,2}, with 0 < 
rı = rıl < T2 = T21 < 1. When a subarea has weed, 
it remains until an intervention, Prob[xi+1 i = 2|£ti = 
2| = 1 for every i. From these assumptions (13) can be 
computed or equivalently the transition probability matrix 











P,; = Prob[ay+1 = iļz; = j] can be computed, as follows. 
First, let m;(i) = x; extract the value of the indicator state 
j € {1,...,N} for a given state i € {1,...,2%} and 
let N (i,t) = (46-1, Zti+1) N (1, t) = X12, N(N, t) = 
zt N—1 be the neighboring indicator states. Then P;; = 0 
if there exists Z such that mẹ(i) = 1 and me(j) = 2 and, 
otherwise, 


Pij = II TN (Lt) II (1—ryce,ty) 
LEM, 3,2 £EMi,5,1 
with Mijs = {¢lmeG) = 1, meli) = n}, & € {1,2}. 
Each sensor i € {1,..., M } can measure several subareas, 


according to (15), but for simplicity here it is assumed it only 
measures one subarea l; € {1,..., N}. Thus y+; € {1,2} 
and an error probability is assumed 


Prob[yr i = 2|£ e; = 1] = Probly:; = L£; = 2] = ep- 


From these assumptions, (15) can be computed or equiva- 
lently Ri; = Probly, = i|z, = j]. Let xj(i) = y; extract 
the value of the subarea measurement j € {1,..., M} for 
a given measurement i € {1,...,2™} and let s;j be the 
number of subarea measurements associated with y; = 7 that 
provide a correct estimate for the corresponding subarea state 
associated with z; = j, i.e., Sij = {lx (i) = Te, (j) }|. Then 


Rig = (1— ep) el, 


The running cost in (16), (17) can be written as the running 
cost in (18) and boils down to 


g(i,t,0) = d [m (i)— 1 mn (i) — 1] Iw. 


We start by considering a very special case with just one 
field N = 1, x, € {1,2}, one measurement y+ € {1,2} and 


P= 1l—ry, 0 R= l— ep €p 
T11 1}? €p l — ep f 


where P;; = Probla41 = ila; = J], Rij = Probly = 
ilz, = j] i,j € {1,2}. This simple case allows one to still 
compute the optimal policy and understand how far, for this 
example, is the cost of the consistent policy. The parameters 
considered are h = 10, d = 1, ry, = 0.1, €p = 0.2. 
Figure 3 shows the results of periodic control, optimal policy 
(or relaxed dynamic programming with € = 0) considering 
ô € {0.5,1,1.5,2,2.5,3,3.5} and of the consistent policy 
with 6 = 0 and 


wma € {0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 
0.13, 0.15, 0.17, 0.2,0.3,0.4} ` 


As it can be seen, the optimal policy yields a significant 
reduction of cost Js, when the running cost is (30), for the 
same average intervention interval with respect to periodic 
control. The consistent policy uses the same parameters and 
provides results close to optimal. 

We now consider a more realistic example with N = 12 
subareas, M = 3 sensors lı = 3, lo = 6, l3 = 9, still with 
d = 1, and parameters rıı = 0.02, r21 = 0.2, rag = 0.4, 
€p = 0.02, h = 20. For this example the number of states 


(30) 
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Fig. 3: Average (weed) cost Js versus average rate of control 
actions Je for three policies: periodic, optimal and consistent 
for a simple example with N = M =1 
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Fig. 4: Average (weed) cost J, versus average rate of control 
actions J. when N = 12 for periodic control, consistent with 
M = 3 sensors and full state feedback M = 12 


is n = 2”? 
dynamic programming since solving (25) with c € 
is computationally hard. However, we can still compute the 
consistent policy. The results are depicted in Figure 4 consid- 
ering parameters wm,ı E {0.2, 0.4, 0.6, 1, 1.5, 2, 2.5, 3, 3.5} 
and ô = 0. 


4096. We can no longer apply relaxed 
R4096 


VII. CONCLUSION 


We have proposed a new framework to determine when 
a partially observable Markov decision process with finite 
state, input and measurement spaces should be sampled 
or/and intervened, assuming these operations are costly. We 
proposed two approaches to find a policy for the time 
intervals between interventions based on available data in 
between these interventions, in a event-triggered control fash- 
ion. The approach based on relaxed dynamic programming 
can provide nearly optimal cost when the state dimension is 
very small, but it is not suitable for larger dimensional prob- 
lems, as this involves solving large linear programs whose 
dimension increases significantly with the state dimension. 
In turn the approach based on consistent control provides a 
simple and effective solution to the problem. 
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VIII. PROOFS 
A. Proof of Lemma 1 


The first statement follows directly from Assumptions 1 
and 3 and the structure of the model. Due to these assump- 
tions, the process state has the same probability distribution 
at sg when (5a) holds or at se; when (5b) holds. Moreover, 
the model can be written in terms of the functions f, g, and 
h given in Section II-A, which only depend on the process 
variables in an epoch {s,s¢ + 1,...,8¢41}. This allows 
one to conclude that indeed the stopping time problems (18) 
are identical and admit a solution that only depends on the 
information set H£, as shown in (19). 

To prove the second statement, we consider an arbitrary 
policy for the stopping times 7 taking the form (19) but 
not necessary optimal. We start by noticing that due to the 
decomposition of J, given in (31), we can also decompose 
Jay as 














L(T)-1 
Jw= lim FE 2 be + a (ze Gn pum) 6D 
t=8L(T) 
with 
Se+1—1 
0 = 5 glt, 64, de) +6 
t=Se 


independently and identically distributed random variables 
due to the first part of the present lemma. When taking the 
limit, the last term vanishes and due to Wald’s identity [24, 
Ch.1] 


EL YO 6] = ELTEL] 


[06] = ( 


Moreover, due to the key renewal theorem [24, Ch.3] 


where 


























Set+te—1 
al 5 TEE 


t=se 




















lim E[L(T)] = 


T= afre] 











From these equations we conclude that for an arbitrary policy 
for Te taking the form (19) 


J = 1 
"= Ed 


Thus, the optimal policy that minimizes the right-hand side 
is also an optimal policy for Jav 

In order to prove the third statement of the present Lemma, 
note that if the Loa cost (18) of the optimal policy is 
lower bounded by £3, J;,,, > 2 then it must be the case that, 
for any policy for 7, 


























Se+te—1 
JDD Manse atnda +8) 


t=Sse 


Se+te—-1 


LD) glterte, Csitt: Pse)] + ô > BE[re] 


t=se 


























or equivalently that 


set+te—1 


| 5 (g(£sett, sett, Ose) — 8) +6 > 0 


t=Sse 














for any policy. Conversely for some ( if we can find a policy 
for which 
Se+Te—1 


| DD (g(Ese+t, Csett, Ose) — 8) +8 <0 


t=Se 














then Jay < 8. Since the number of policies for 7 is finite, 

say P, we can write the optimal cost as the minimum over a 

finite number of continuous function of 8 (for fixed policy) 
Se+tTe—1 


s| y (g(£sett Csi+t: Pse) — B)| +4 


t=se 














and thus the cost is a continuous function of 3. Moreover, it 
is clear that it is non-increasing with 8 and that it is positive 
when 6 = O and negative for very large 8. The value of 
6 that leads to zero cost is thus equal to Jav = 8 and the 
corresponding optimal policy is an optimal policy for the 
original problem concluding the proof. 
B. Proof of Lemma 2 
Using induction, we have J;,(q;,) = Vi(q7z,) = ô and 
assuming 
Jepi (qi) < Verildi) < Jiale) + elh — 
then 
Ji (q) = min{d + e, (qi 9 — 8 + €) 
< min{d + €, (49 — 8 + €) 
< min{ð + e(ħ — t), (gf 
[Jeri (de+1) + €( 
= eh — t) + min{6, (473 — 6 + Elden 
= (u) + elh — t) 
The relaxed dynamic programming algorithm ensures 
through (24) that 


Velge) = min cq < Jé la) 


(t+ 1)), 
























































(de+1)|He]} 





from which we conclude the desired inequality 
Vilage) < lae) + elh — t). 
(The inequality J:(q:) < Vi(qt) is clear). 


C. Proof of Lemma 3 


Let us first assume that (Sa) holds. From the proof of 
Lemma 1 we conclude that for any policy for the stopping 
times tT, and for an arbitrary 0 € No 


1 Sette—1 
e ( ‘| 5 glLse+t, Csitt; Psp) +a) 


afre] t=se 











Jav = 

















= h and £ = 0 (sz = 0) leads to 


h—1 
[Xg (xt, t, o)] +a) 
t=0 


which specialized to Tg 











(32) 





Note that 


























h-1 b h-1 
a| g(x, t, Go) =X a [X g(x, t, i)] 
t=0 i=1 t= 
Due to (5a), 


and since the probability distribution of x; (when measure- 
ments y; are unknown) is simply P*p;) we obtain 


g(a, t, i)] = gI P* pi 


Replacing these expressions in (32) we obtain Jay := Js +ô i 
with J, given by (26) as desired. 

Let us now assume that (5b) holds. In the first epoch 
{so, So+1,...,51—1} the cost will be slightly different than 
in the subsequent epochs since the distribution of £o = Tso 
is different from those of x,,, which are all the same for 
£ > 1 due the reset imposed at times s¢+ 1 imposed by (5b). 
Since we are interested in an average cost this plays no 
role and we can simply compute (32) considering the epoch 
{s1, $1 +1,...,52- 1} with sı = h and s2 = 2h, i.e., 


{2h} 
h ( JDD g(xt,t — h, b1)] +a) 


t=h 























Jay = 





Since the distribution of x; is known to be p; with i = ¢y, 
the distribution of zp is Ptt pi. The distribution of £p+1 is 
pi and of £h414r, r E€ {1,2,...,h — 1} is Ppi. Thus, in 
this case 


[g(£n+t, t, i)] = gP! pit = {1, Eo h— 1} 

















and 














‘[9(@n, 0, i)] = gP”! pi 

Replacing these expressions in (32) we obtain Jay := Js +ô i 
with J, given by (26) as desired. 

D. Proof of Theorem 1 


Due to Lemma 1 it suffices to consider = O and for 
simplicity we consider b = 1. The proposed policy in this 
interval can be rewritten as follows: 


T =min{r € {1,...,h}| >> 
t=0 











[9(ve, t, LHP] > —d+we rr} 

































































(33) 
This ensures that 
To—1 
NC Elges, t, IH] + 6 — wers < 0 
t=0 
Thus 
To—1 
[X Elg(ae, t, DHP] + 6 — wes] < 0 
t=0 
We will prove that 
To—1 To—1 
[Y Elgent, DIT = ELY gent, 1] 69 
t=0 t=0 


so that the inequality above implies 




















E 1 
ilro] I> g(x, t,1)] + Fino] < Wet < Wm,1 
t=0 




















leading to the desired consistency conclusion. 
It suffices to establish (34). To this effect we start by 
noticing that 











g(x, t, 1)/H?] T g(t, t, 1) 





with 7 is a martingale with respect to the filtration HÌ since 
[ny +1l 29] = mj +ELElg (3,3, DIH?]-9(23, 5, DIH?) = my 
Due to the optional sampling theorem [25, Sec. 12.4, Th. 11] 
[19, ] (which can be applied since Toh) 

































































[nro] =0 
so that 
|S) Eglei t, DIH? — (D> g(az,t,1))] = 0 
t=0 t=0 


which implies (34). 


